Introduction and Report Objectives

Canada is the second-largest country in the world by total area, home to people from many countries and cultural backgrounds and is also the world’s tenth largest economy. As a group, and being new to Canada and coincidently to R as well, it made perfect sense to analyse the Canadian economy to gain insights about the country and to learn R and begun our journey towards becoming a Data Scientist.

By analysing the economic traits such as the variation of product prices, fluctuation in retail sales of important commodities, household spending habits, income and wealth, we hope to determine the following:

  • How the Canadian economy has evolved over time and how these traits coorelate to each other
  • To gain insights by answering questions such as -
    • How does the overall household spending changed over year and how it varied based on different parameters?
    • How does the change in product prices over years impacted sales and household spending?
    • How does educational qualifications, geography, and age dynamics influenced income?

Some considerations:

  • Common data taken for years 2010 to 2017
  • Some datasets have limited or no data for considered years
  • Accuracy of datasets

Preliminary Analysis 1 - Income Characteristics

The first thing we were interested to see how much Canadians earn and to determinew how different parameters effect it. So to begin examining the income charachteristics, we made use of Income data from 2016 Consensus and took into consideration four variables:

  • Age Groups
  • Educational Qualifications
  • Provinces (location of work)

To best represent our findings, we used a Diverging Bar plot, which not only is quite useful in presenting data especially if the dataset contains positive and negative values but also easy to interpret.

## Analysis 1 - EMP_INC_DEG.csv (Income characheteristic)

df_emp_inc_deg<-read_csv("EMP_INC_DEG.csv") %>% clean_names() %>%
  filter(!is.na(value)) %>%
  select_if(~ !any(is.na(.)))

temp_age_sal<-df_emp_inc_deg %>% filter(age_group!="Total, 15 to 64 years") %>% select(age_group,value) %>% rename(income=value)
temp_age_sal$age_group<-as.factor(temp_age_sal$age_group)
temp_age_sal<-temp_age_sal %>% group_by(age_group) %>% summarise(mean(income)) %>% clean_names()
temp_age_sal$inc_diff<-round(temp_age_sal$mean_income-mean(temp_age_sal$mean_income))
temp_age_sal$inc_type<-ifelse(temp_age_sal$inc_diff<0,"Lower than mean Income","Higher than mean Income")

temp_deg_sal<-df_emp_inc_deg %>% filter(highest_certificate_diploma_or_degree!="Total highest certificate, diploma or degree") %>% select(highest_certificate_diploma_or_degree,value) %>% rename(income=value)
temp_deg_sal$highest_certificate_diploma_or_degree<-as.factor(temp_deg_sal$highest_certificate_diploma_or_degree)
temp_deg_sal<-temp_deg_sal %>% group_by(highest_certificate_diploma_or_degree) %>% summarise(mean(income)) %>% clean_names()
temp_deg_sal$inc_diff<-round(temp_deg_sal$mean_income-mean(temp_deg_sal$mean_income))
temp_deg_sal$inc_type<-ifelse(temp_deg_sal$inc_diff<0,"Lower than mean Income","Higher than mean Income")

temp_prov_sal<-df_emp_inc_deg %>% filter(geo!="Canada") %>% select(geo,value) %>% rename(income=value)
temp_prov_sal$geo<-as.factor(temp_prov_sal$geo)
temp_prov_sal<-temp_prov_sal %>% group_by(geo) %>% summarise(mean(income)) %>% clean_names()
temp_prov_sal$inc_diff<-round(temp_prov_sal$mean_income-mean(temp_prov_sal$mean_income))
temp_prov_sal$inc_type<-ifelse(temp_prov_sal$inc_diff<0,"Lower than mean Income","Higher than mean Income")

After doing the above analysis, following things come to light:

  • For ages 30 and upwards, income generally increases as age increases.
  • Having educational qualifications, directly relates with increase in income
  • Provinces such as Nunavut, Alberta, and Northwest Territories have the highest income in comparison to densely populated provinces such as Ontario and Quebec.
  • Based on above results, it could be incurred that in order to get a higher income, it would be good if we could get a job in Northwest Territories after completing our Master’s, and attaining an age of atleast 30 years.

Preliminary Analysis 2 - Variation in Prices, Retail Sales, and Household Spending

As the next step, we wanted to examine how Prices of important goods have fluctuated over the years, how their sales have changed, and the amount Canadians have spend over the years on the same and whether there is a connection between these three parameters.

Analysis 2.1 - Household Spending

## Analysis 2.1 - HS_HST.csv (household spending based on household type)

df_hs_hst<-read_csv("HS_HST.csv") %>% clean_names() %>%
  filter(!is.na(value)) %>%
  select_if(~ !any(is.na(.))) %>%
  select(1,5,6,12,13)

temp_hst_sp<-df_hs_hst %>% filter(household_type!="All classes") %>% select(household_type,value) %>% rename(spending=value)
temp_hst_sp$household_type<-as.factor(temp_hst_sp$household_type)
temp_hst_sp<-temp_hst_sp %>% group_by(household_type) %>% summarise(mean(spending)) %>% clean_names()
temp_hst_sp$mean_spn<-round(temp_hst_sp$mean_spending-mean(temp_hst_sp$mean_spending),2)
temp_hst_sp$sp_type<-ifelse(temp_hst_sp$mean_spn<0,"Lower Spending","Higher Spending")

temp_hs_hst<-df_hs_hst %>%
  select(2,3,5) %>%
  rename(expenditure_type=household_expenditures_summary_level_categories,spending=value) %>%
  filter(household_type!="All classes") %>%
  filter(expenditure_type %in% c("Food expenditures","Shelter","Communications","Clothing and accessories","Transportation","Health care","Personal care","Recreation","Tobacco products and alcoholic beverages","Miscellaneous expenditures","Gas and other fuels (all vehicles and tools)"))

temp_hs_hst$household_type<-as.factor(temp_hs_hst$household_type)
temp_hs_hst$expenditure_type<-as.factor(temp_hs_hst$expenditure_type)
temp_hs_hst<-temp_hs_hst %>% group_by(household_type,expenditure_type) %>% summarise(mean(spending)) %>% clean_names



## Analysis 2.1 - HS_SAR.csv (household spending based on area of residence)

df_hs_sar<-read_csv("HS_SAR.csv") %>% clean_names() %>%
  filter(!is.na(value)) %>%
  select_if(~ !any(is.na(.))) %>%
  select(1,5,6,12,13)

temp_sar_sp<-df_hs_sar %>% filter(size_of_area_of_residence!="All classes") %>% select(size_of_area_of_residence,value) %>% rename(spending=value)
temp_sar_sp$size_of_area_of_residence<-as.factor(temp_sar_sp$size_of_area_of_residence)
temp_sar_sp<-temp_sar_sp %>% group_by(size_of_area_of_residence) %>% summarise(mean(spending)) %>% clean_names()
temp_sar_sp$mean_spn<-round(temp_sar_sp$mean_spending-mean(temp_sar_sp$mean_spending),2)
temp_sar_sp$sp_type<-ifelse(temp_sar_sp$mean_spn<0,"Lower Spending","Higher Spending")

temp_hs_sar<-df_hs_sar %>%
  select(2,3,5) %>%
  rename(expenditure_type=household_expenditures_summary_level_categories,spending=value) %>%
  filter(size_of_area_of_residence!="All classes") %>%
  filter(expenditure_type %in% c("Food expenditures","Shelter","Communications","Clothing and accessories","Transportation","Health care","Personal care","Recreation","Tobacco products and alcoholic beverages","Miscellaneous expenditures","Gas and other fuels (all vehicles and tools)"))

temp_hs_sar$size_of_area_of_residence<-as.factor(temp_hs_sar$size_of_area_of_residence)
temp_hs_sar$expenditure_type<-as.factor(temp_hs_sar$expenditure_type)
temp_hs_sar<-temp_hs_sar %>% group_by(size_of_area_of_residence,expenditure_type) %>% summarise(mean(spending)) %>% clean_names()

We first began with drilling into the Household Spending dataset where we hoped to determine the following:

  • What are the most expensive areas?
  • How spending is varying based on different parameters such as area of residence and type of household?

Following observations could be made based upon the above graphs:

  • Top 4 expenditure areas for Canadians were Transportation, Shelter, Food, and Clothing.
  • Population Centre with sizes upwards 250,000 tend to spend more.
  • It is interesting to see that Population Centres with size upwards 1,000,000 are right near the mean expenditure. Hence, it could be said that in order to spend lesser, it may be better to either move to population centres with size lower than 250,000 or move to centres with size upwards of 1,000,000.
  • The final result is an expected one, which is oneperson households tend to spend the lowest amount whereas, couples with children tend to spend quite a bit over the mean spending amount.

Analysis 2.2 - Pricing of Goods

While examining the pricing dataset, the biggest issue we faced was that it contained hundred of products. We couldn’t possibly use all of the goods and products and hence to make matters simpler, we referred to various online sources and decided to take only those goods/products which are quite popular in Canada. Following which, we divided the products into different categories for easier data manipulation.

Based of this graph, prices of most products saw an upward trend till year 2014 after which they either plateaued or decreased slightly. The most serious drop however was obseverd in case of fuel prices which saw a massive drop from 2014 all the way till year 2016 after which it began recovering.

Analysis 2.3 - Retail Sales of different Products

Analysis 2.3.1 - Retails Sales of different categories of Goods over years

Retail sales for the selected categories of goods mostly kept on increasing with occasional hiccups. Most sales came from Motor vehicles while the lease came from sporting goods, hobby, and music stores category.

Primary Analysis - Bringing Everything Together

After finishing priliminary analysis, we were finished with understanding and cleaning the datasets. Next we decided to see how the different economic traits [Income, Price of Goods, Retail Sales, and Household Spending] relate with each other].

Limitations Faced

  • First limitation we identified was the missing links between the datasets. The Pricing dataset, even though had hundreds of products but it could be only liked with other datasets for two categories namely Food and Fuel. Hence Sales and Spending datasets, even though they linked with each other just fine, could only link with pricing using Food and Fuel category.
  • Limited knowledge of statistical techniques to implement.

Primary Analysis 1.1 - Retail Sales vs Household Spending

Comparing trends we see that even though sales for clothing and accessories, and fuel continued to increase, there was a sharp fall in amount spent by households and it remained lower for most of 2014 to 2016 period.

temp_hst_exp<- df_hs_hst %>%
  select(1,2,3,5) %>%
  rename(expenditure_type=household_expenditures_summary_level_categories,spending=value,year=ref_date) %>%
  filter(household_type!="All classes") %>%
  filter(expenditure_type %in% c("Food expenditures","Shelter","Communications","Clothing and accessories","Transportation","Health care","Personal care","Recreation","Tobacco products and alcoholic beverages","Miscellaneous expenditures","Gas and other fuels (all vehicles and tools)")) %>% group_by(expenditure_type,year) %>% summarise(mean(spending)) %>% clean_names()
temp_hst_exp<- temp_hst_exp %>% rename(mean_spending_dollars=mean_spending)

temp_df_rsv<-df_rsv_pri %>% rename(mean_sales_millions=mean_value_in_dollar_million)
temp_df_rsv$sales_type<-as.character(df_rsv_pri$sales_type)


for (i in 1:nrow(temp_df_rsv))
{
  if(str_detect(temp_df_rsv$sales_type[i],"Furniture")==TRUE)
  {
    temp_df_rsv$sales_type[i]<-"Shelter"
  }
  else if (str_detect(temp_df_rsv$sales_type[i],"Clothing")==TRUE)
  {
    temp_df_rsv$sales_type[i]="Clothing and accessories"
  }
  else if (str_detect(temp_df_rsv$sales_type[i],"Food")==TRUE)
  {
    temp_df_rsv$sales_type[i]="Food expenditures"
  }
  else if (str_detect(temp_df_rsv$sales_type[i],"Gasoline")==TRUE)
  {
    temp_df_rsv$sales_type[i]="Gas and other fuels (all vehicles and tools)"
  }
  else if (str_detect(temp_df_rsv$sales_type[i],"Health")==TRUE)
  {
    temp_df_rsv$sales_type[i]="Health and personal care"
  }
  else if (str_detect(temp_df_rsv$sales_type[i],"Miscellaneous")==TRUE)
  {
    temp_df_rsv$sales_type[i]="Miscellaneous expenditures"
  }
  else
  {
    temp_df_rsv$sales_type[i]=temp_df_rsv$sales_type[i]
  }
}

temp_df_rsv$sales_type<-as.factor(temp_df_rsv$sales_type)

temp_rsv_food<- temp_df_rsv %>% filter(sales_type=="Food expenditures") %>% rename(mean_sf_m=mean_sales_millions)
temp_rsv_fuel<- temp_df_rsv %>% filter(sales_type=="Gas and other fuels (all vehicles and tools)") %>% rename(mean_sfu_m=mean_sales_millions)

# Merged Datasets

temp_rsv_hst<-merge(temp_df_rsv,temp_hst_exp,by.x=c("sales_type","year"),by.y=c("expenditure_type","year"))

Primary Analysis 1.2 - Fuel and Food - Price vs Sales vs Household Spending

From Pricing dataset, we had fuel and food as the common link between the three datasets. Hence we were able to go ahead with performing a linear regression to check how changes in Price affected Sales and Household Spending.

temp_sar_food<- df_hs_sar %>% filter(size_of_area_of_residence=="All classes" & household_expenditures_summary_level_categories=="Food expenditures") %>% select(3,1,5) %>% rename(spending_type=household_expenditures_summary_level_categories,year=ref_date,mean_fsp_du=value)

temp_sar_fuel<- df_hs_sar %>% filter(size_of_area_of_residence=="All classes" & household_expenditures_summary_level_categories=="Gas and other fuels (all vehicles and tools)") %>% select(3,1,5) %>% rename(spending_type=household_expenditures_summary_level_categories,year=ref_date,mean_fusp_du=value)

temp_pri_ph<-df_pri_ma %>% mutate(
  product_high=case_when(
    str_detect(products,"juice|Apples|Oranges|Bread|milk|cheese|oil|Potatoes|salmon|beef|Wieners|Tea|Coffee|drinks")==TRUE~'Food and beverage',
    str_detect(products,"gasoline")==TRUE~'Fuel',
    TRUE~'Not Needed'
  )) %>% select(1,3,4) %>% filter(product_high!="Not Needed")


temp_pri_food<- temp_pri_ph %>% filter(product_high=="Food and beverage")
temp_pri_food$product_high<-as.factor(temp_pri_food$product_high)
temp_pri_food$year<- as.factor(substring(temp_pri_food$year,1,4))
temp_pri_food$product_high<-as.factor(temp_pri_food$product_high)
temp_pri_food<-temp_pri_food %>% group_by(product_high,year) %>% summarise(sum(value_in_dollar_unit)) %>% clean_names() %>% rename(sum_fp_d_u=sum_value_in_dollar_unit)
temp_pri_food$sum_fp_d_u<-round(temp_pri_food$sum_fp_d_u,2)
temp_pri_food$year<-as.numeric(as.character(temp_pri_food$year))


temp_pri_fuel<- temp_pri_ph %>% filter(product_high=="Fuel")
temp_pri_fuel$product_high<-as.factor(temp_pri_fuel$product_high)
temp_pri_fuel$year<- as.factor(substring(temp_pri_fuel$year,1,4))
temp_pri_fuel$product_high<-as.factor(temp_pri_fuel$product_high)
temp_pri_fuel<-temp_pri_fuel %>% group_by(product_high,year) %>% summarise(mean(value_in_dollar_unit)) %>% clean_names() %>% rename(mean_fup_d_u=mean_value_in_dollar_unit)
temp_pri_fuel$mean_fup_d_u<-round(temp_pri_fuel$mean_fup_d_u,2)
temp_pri_fuel$year<-as.numeric(as.character(temp_pri_fuel$year))

## Dataset for fuel prices, sales, and expenditures

temp<- merge(temp_pri_fuel[,c(2:3)],temp_rsv_fuel[,c(2:3)],by="year")
temp_fuel<- merge(temp[,c(1:3)],temp_sar_fuel[,c(2:3)],by="year")

temp<- merge(temp_pri_fuel[,c(2:3)],temp_rsv_fuel[,c(2:3)],by="year")
temp_fuel<- merge(temp[,c(1:3)],temp_sar_fuel[,c(2:3)],by="year")
temp_fuel$year<-as.factor(temp_fuel$year)

temp<- merge(temp_pri_food[,c(2:3)],temp_rsv_food[,c(2:3)],by="year")
temp_food<- merge(temp[,c(1:3)],temp_sar_food[,c(2:3)],by="year")
temp_food$year<-as.factor(temp_food$year)

Reference

Datasets and Source

Data Source - https://open.canada.ca/en/open-data

Datasets:

Other References